6/6/2017

Overview

  • Dimension reduction refers to transforming data with many related variables into data with fewer variables for the purpose of visualization and analysis.

  • Common methods we will discuss today:
    • Principal Components Analysis (PCA)
    • Factor Analysis (FA)
    • Multidimensional Scaling (MDS)

Objectives

  • Participants will be familiar with common techniques for dimension reduction
  • Participants will understand the similarities and differences among the presented techniques
  • I will use case-studies and examples to introduce methods while keeping mathematical details to a minimum.
  • This workshop is not intended as a guide to implementing these methods in particular software. With that disclaimer, links to data and R scripts for select examples are below:
    • Example XX
    • Example XY

Comparison of methods

  • In a principal components analysis we create a new set of uncorrelated variables which are linear combinations of the original variables. These new variables - the principal components - are chosen sequentially to maximize the variance explained.

  • In a factor analysis we look for a small number of latent factors which explain the correlation structure among the original variables.

  • Multidimensional scaling is a technique for converting a matrix of pairwise distances into a low-dimensional map that preserves the distances as well as possible.

Principal Components Analysis

PCA Toy Example

Diabetes

  • Our first toy example uses data on 145 adults from three groups: controls, 'chemical' diabetics, and overt diabetics.
  • Here is correlation matrix for the three variables: glucose, insulin, and steady state plasma glucose (sspg).
##         glucose insulin  sspg
## glucose    1.00    0.96 -0.40
## insulin    0.96    1.00 -0.35
## sspg      -0.40   -0.35  1.00

Scatterplot Matrix

  • Here is a scatter plot matrix.

Mathematical Details

  • Consider a data matrix \(D\) consisting of continuous variables.
    Mathematically, PCA solves the following problem:

\[ w_1 := \arg\max_{||w||=1} \textrm{var}(Dw) = \arg\max_{||w||=1} w'X'Xw \]

\[ w_k := \arg\max_{||w||=1} \textrm{var}(D_kw) \]

where \(D_k := D - \sum_{j<k} Dww'\) is the residual relative to the first \(k-1\) components.

  • In practice, this is acomplished using:
    • The eigen-decomposition of the covariance matrix cov(\(X\)) or correlation matrix cor(\(X\))
    • The singular value decomposition of \(X\)

The Diabetes Toy Example

  • This is the eigen-decomposition of a (symmetric) matrix: \[ \Sigma = \Gamma \Lambda \Gamma' \]
  • The eigenvectors \(\Gamma\) (loadings) as weights for creating new orthognal variables from the original variables: \(D_{new} = D\Gamma\).
  • The eigenvalues for each component are proportional to the explained variance

Explained Variance

  • The eigenvalues for each component are proportional to the explained variance
  • The diabetes correlation matrix has eigenvalues: 2.196, 0.769, 0.035

Component Loadings

  • Each of the new components is a linear combination of the original variables
  • These are the weights for the first two variables:

Plotting the new variables

  • After dimension reduction, we can create a scatterplot of the new variables.

Biplots

  • You will often see the component loadings and new variables portrayed as a 'biplot'.

Covariance vs Correlation

  • In the diabetes example, we use the eigendecomposition of the correlation matrix because the data have different scales.
  • Working with correlations is equivalent to first transforming all variables into z-scores.
  • This means that each of the original variables is weighted equally.
  • When the original variables are on the same scale, it can make sense to use the eigendecomposition of the covaraince matrix so that variables with higher variance are weighted more heavily.

PCA Example

Epithelial Genes in CTC

  • In this example, the variables are eight genes (measured by qPCR) known to be epithelial markers.
  • The samples are the results of an assay for capturing circulating tumor cells (CTCs) in men with metastatic prostate cancer.
  • The goal of this analysis is to use data from positive and negative controls to create a predictive model for identifying samples with CTCs present.

Epithelial Genes

  • Here are the data as a scatterplot matrix

Choosing not to scale

  • In this example, I chose not to scale the variables since the values were already on a normalized scale.
  • Here are the results, after flipping the sign for both components:

Defining an 'epithelial score'

  • In this case, we can use the first component as an 'eptithelial score' with weights:

Defining an 'epithelial score'

  • In the actual application, this score is used to classify which samples contain CTCs.

Factor Analysis

Mathematical Details (1/2)

  • Consider a data matrix \(D\) with \(p\) variables (columns) that have been centered.
  • In a factor analysis, we look for a small number \(k\) of latent factors \(F = [F_1, \dots, F_k]\) such that: \[ D \sim LF + \epsilon \]
  • The loading matrix \(L\) has \(p\) rows and \(k\) columns. Each row of \(L\) explains how one of the original variables is related to the latent factors.
  • The matrix of factor scores \(F\) has \(n\) rows and \(k\) columns meaning that each observation has a unique set of factor scores.
  • We usually assume the factors \(F\) are uncorrelated and have mean zero, and that \(F\) and \(\epsilon\) are independent.

Mathematical Details (2/2)

  • In a factor analysis, we look for a small number \(k\) of latent factors \(F = [F_1, \dots, F_k]\) such that: \[ D \sim LF + \epsilon \]

  • Another way to look at this is in terms of covariance matrices, \[ \textrm{cov}(D) = LL' + \Psi \] with \(\Psi\) describing the unique or unexplained variance for each original variable.

Example: Measures of Glycemic Control

  • For diabetics, blood suguar control has important health consequences.
  • Continuous glucose monitors are instruments that measure someones glucose levels at regular intervals (~5 min).
  • Many 'metrics' for glycemic control have been proposed for mapping this time series to a single summary statistic.

  • The following example works with a collection of metrics computed on baseline data from a JDRF clinical trial (n=443) that have been normalized using Box-Cox transformations.

Glycemic Control Metrics: Correlation

  • A heatmap of the correlation matrix shows there are 2-4 distinct groups of metrics.

Examining the Loadings

  • We will begin with a model using three factors.
  • Here are the loadings (L):
    Factor1 Factor2 Factor3
    DySF -0.018 0.060 0.784
    Mean 0.920 -0.355 0.114
    SD 0.931 0.208 0.234
    SD.Slope 0.554 0.201 0.781
    Time.Hyper 0.942 -0.283 0.097
    Time.Hypo -0.076 0.976 0.137
    AUC.Hyper 0.877 -0.057 0.102
    AUC.Hypo 0.126 0.871 0.126
    CONGA.4 0.876 0.252 0.329
    MODD 0.909 0.230 0.238
    ADRR 0.722 0.433 0.468
    HBGI 0.966 -0.197 0.149
    LBGI -0.134 0.974 0.122
    M.Value 0.296 -0.198 -0.033
    MAGE 0.921 0.233 0.226
    MAG 0.518 0.256 0.772
    GRADE 0.959 -0.110 0.169
  • Recall: Metric \(\sim ~ L_1F_1 + L_2F_2 + L_3F_3\)

Loadings and explained variance

  • For each variable, the squared loadings tell us the percent of variance explained by each factor.
Factor1 Factor2 Factor3
DySF 0.00 0.00 0.61
Mean 0.85 0.13 0.01
SD 0.87 0.04 0.05
SD.Slope 0.31 0.04 0.61
Time.Hyper 0.89 0.08 0.01
Time.Hypo 0.01 0.95 0.02
AUC.Hyper 0.77 0.00 0.01
AUC.Hypo 0.02 0.76 0.02
CONGA.4 0.77 0.06 0.11
MODD 0.83 0.05 0.06
ADRR 0.52 0.19 0.22
HBGI 0.93 0.04 0.02
LBGI 0.02 0.95 0.01
M.Value 0.09 0.04 0.00
MAGE 0.85 0.05 0.05
MAG 0.27 0.07 0.60
GRADE 0.92 0.01 0.03

Cumulative variance

  • The overall variance explained by each factor is simply the average across variables.
    Factor1 Factor2 Factor3
    Variance Explained 52.3 20.4 14.4
    Cumulative Variance 52.3 72.7 87.1

Choosing the number of factors

  • We can use the amount of variance explained by each factor to select how many to retain.
  • There are goodness of fit test as well, but more appropriate when designing a single measurement scale.

Choosing the number of factors

  • Adding a third factor increases the explained variance by 15%, while a fourth factor explains less than 2% of additional variance.
    Factor 1 Factor 2 Factor 3 Factor 4
    2 56.7 79.2
    3 52.3 72.7 87.1
    4 52 72.5 86.4 88.9

Plotting the explained variance

  • A bar chart of the explained variance (squared loadings) allows us to visualize the relations among the variables and latent factors.

Plotting the loadings

  • It is also useful to plot the raw loadings in factor space.

Factor scores

  • The latent factor scores can be used in downstream analyses.

Multidimensional Scaling

MDS

  • Multidimensional scaling is a technique for converting a matrix of pairwise dissimilarities into a low-dimensional map that preserves the distances as well as possible.

  • The first step in any MDS is choosing a similarity measure. Often this will be a metric or distance, i.e.:
    • Euclidean Distance for continuous variables
    • Manhattan distance or Jacaard dissimilarity for binary varibles
  • Similarity measures can often become disimilarity measures by inverting \(x \to 1/x\) or subtracting form 1 \(x \to 1-x\).

Mathematical Datails (1/2)

  • Consider a matrix of dissimilarities \(D = \{d_{ij}\}_{i,j}\)
  • Metric MDS finds new coordinates \(X = \{(x_{i1}, x_{i2})\}_i\) that minimize the "stress" or "strain", \[ \sum_{i,j} [d_{ij} - ||x_i - x_j||^2]^{1/2}. \]

Mathematical Details (2/2)

  • The steps to perform a classical MDS are:
    • Obtain a matrix of squared pairwise dissimilarities
    • Double center this matrix by subtracting the row/column mean from each row/column
    • Compute the eigen decomposition of the doube-centered dissimilarities
  • In metric MDS a non-euclidean distances is used and the optimization problem is solved using an iterative majorization-maximization algorithm.
  • Non-metric can be used when the dissimilarities are obtained directly rather than computed from other variables.

MDS Example 1

Distances between US cities

  • The table below shows distances, in miles, between several major US cities.
    Atl Chi Den Hou LA Mia NY SF Sea DC
    Atlanta 0 587 1212 701 1936 604 748 2139 2182 543
    Chicago 587 0 920 940 1745 1188 713 1858 1737 597
    Denver 1212 920 0 879 831 1726 1631 949 1021 1494
    Houston 701 940 879 0 1374 968 1420 1645 1891 1220
    LosAngeles 1936 1745 831 1374 0 2339 2451 347 959 2300
    Miami 604 1188 1726 968 2339 0 1092 2594 2734 923
    NewYork 748 713 1631 1420 2451 1092 0 2571 2408 205
    SanFrancisco 2139 1858 949 1645 347 2594 2571 0 678 2442
    Seattle 2182 1737 1021 1891 959 2734 2408 678 0 2329
    Washington.DC 543 597 1494 1220 2300 923 205 2442 2329 0

MDS coordinates

  • We can use MDS to obtain a 2-dimensional map preserving distances as well as possible.

Orienting the axes

  • For interpretation, it can be helpful to change the sign on the axes.

Naming the axes

  • You can also aid in interpretation by assigning a name to each axis using subject matter knowledge.
  • I also recommend removing the scales as only the distances between points is meaningful.

MDS Example 2

Shortstop Defense

  • As an example we will compare the defensive value of MLB shortsops from 2016.
  • We will use a collection of advanced defensive metrics from www.fangraphs.com as our starting data.
##                   rGDP rGFP rPM DRS  DPR RngR ErrR  UZR  Def
## Brandon Crawford     0    3  16  19  1.4 16.2  3.7 21.3 28.0
## Francisco Lindor    -2   -2  21  17 -0.8 18.0  3.6 20.8 27.8
## Freddy Galvis        0    2   3   5  0.7  9.1  5.3 15.1 22.0
## Addison Russell     -1    0  20  19 -1.1 14.5  2.0 15.4 21.9
## Andrelton Simmons    2    0  16  18  0.6 12.9  1.9 15.4 20.8
## Jose Iglesias        2    2  -1   3  1.9  2.6  7.2 11.6 17.6

Choosing a scale

  • All of these metrics are in units of 'runs', but have varying scales so we will work with z-scores.
  • Here is a heatamp of the transformed values:

Computing distances

  • Our first step is computing the Euclidean distance between each pair of players using the z-scores:

Plotting the new coordinates

  • Given the distances, an MDS algorithm returns a set of coordinates which can be plotted.

Naming the Coordinates (1/2)

  • As before, it is generally helpful to use subject matter knowledge to create names or concepts for each coordinate.

  • Below we look at the correlation of the first coordinate with each of the original variables.

Naming the Coordinates (2/2)

  • In this case, the first coordinate tracks overall defense value which is closely tied to scores based on range.
  • The second coordinate tracks other aspects of value, primarily value from turning double plays.

The End Result

  • Plot aspects such as color, symbol, and marker size can be used with the new coordinate system to help tell a coherent story.

Summary